Regular expressions

(This documentation is derived from Tom Lords documentation for the Free Software Foundation regular expression library)

Permission to use, copy, modify, distribute, and sell this documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation.

Posix Basic Regular Expressions

The Posix Basic Regular Expression language is a notation for describing textual patterns. Regexps are typically used by comparing them to a string to see if that string matches the pattern, or by searching within a string for a substring that matches.

This chapter introduces the Posix regexp notation. This is not a formal or precise definition of Posix regexps -- it is an intuitive and hopefully expository description of them.

An Introduction to Regexps

In the simplest cases, a regexp is just a literal string that must match exactly. For example, the pattern:

regexp

matches the string "regexp" and no others.

Some characters have a special meaning when they occur in a regexp. They aren't matched literally as in the previous example, but instead denote a more general pattern. For example, the character * is used to indicate that the preceeding element of a regexp may be repeated 0, 1, or more times. In the pattern:

smooo*th

the * indicates that the preceeding o can be repeated 0 or more times. So the pattern matches:

smooth
smoooth
smooooth
smoooooth
...

Suppose you want to write a pattern that literally matches a special character like * -- in other words, you don't want to * to indicate a permissible repetition, but to match * literally. This is accomplished by quoting the special character with a backslash. The pattern:

smoo\*th

matches the string:

smoo*th

and no other strings.

In seven cases, the pattern is reversed -- a backslash makes the character special instead of making a special character normal. The characters +, ?, |, (, and ) are normal but the sequences \+, \?, \|, \(, \), \{, and \} are special (their meaning is described later).

The remaining sections of this chapter introduce and explain the various special characters that can occur in regexps.

Literal Regexps

A literal regexp is a string which contains no special characters. A literal regexp matches an identical string, but no other characters. For example:

literally

matches

literally

and nothing else.

Generally, whitespace characters, numbers, and letters are not special. Some punctuation characters are special and some are not (the syntax summary at the end of this chapter makes a convenient reference for which characters are special and which aren't).

Character Sets

This section introduces the special characters . and [.

. matches any character except the NULL character. For example:

p.ck

matches

pick
pack
puck
pbck
pcck
p.ck

...

[ begins a character set. A character set is similar to . in that it matches not a single, literal character, but any of a set of characters. [ is different from . in that with [, you define the set of characters explicitly.

There are three basic forms a character set can take.

In the first form, the character set is spelled out:

[<cset-spec>]	-- every character in <cset-spec> is in the set.

In the second form, the character set indicated is the negation of a character set is explicitly spelled out:

[^<cset-spec>]	-- every character *not* in <cset-spec> is in the set.

A <cset-spec> is more or less an explicit enumeration of a set of characters. It can be written as a string of individual characters:

[aeiou]

or as a range of characters:

[0-9]

These two forms can be mixed:

[A-za-z0-9_$]

Note that special regexp characters (such as *) are not special within a character set. -, as illustrated above, is special, except, as illustrated below, when it is the first character mentioned.

This is a four-character set:

[-+*/]

The third form of a character set makes use of a pre-defined "character class":

[[:class-name:]] -- every character described by class-name is in the set.

The supported character classes are:

alnum	- the set of alpha-numeric characters
alpha	- the set of alphabetic characters
blank	- tab and space
cntrl	- the control characters
digit	- decimal digits
graph	- all printable characters except space
lower	- lower case letters
print	- the "printable" characters
punct	- punctuation
space	- whitespace characters
upper	- upper case letters
xdigit	- hexidecimal digits

Finally, character class sets can also be inverted:

[^[:space:]] - all non-whitespace characters

Character sets can be used in a regular expression anywhere a literal character can.

Subexpressions

A subexpression is a regular expression enclosed in \( and \). A subexpression can be used anywhere a single character or character set can be used.

Subexpressions are useful for grouping regexp constructs. For example, the repeat operator, *, usually applies to just the preceeding character. Recall that:

smooo*th

matches

smooth
smoooth
...

Using a subexpression, we can apply * to a longer string:

banan\(an\)*a

matches

banana
bananana
banananana
...

Subexpressions also have a special meaning with regard to backreferences and substitutions (see See Backreferences, Extractions and Substitutions).

Repeated Subexpressions

* is the repeat operator. It applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element can be matched 0 or more times:

bana\(na\)*

matches

bana
banana
bananana
banananana
...

\+ is similar to * except that \+ requires the preceeding element to be matched at least once. So while:

bana\(na\)*

matches

bana

bana(na\)\+

does not. Both match

banana
bananana
banananana
...

Thus, bana\(na\)+ is short-hand for banana\(na\)*.

Optional Subexpressions

\? indicates that the preceeding character, character set, or subexpression is optional. It is permitted to match, or to be skipped:

CSNY\?

matches both

CSN

and

CSNY

Counted Subexpressions

An interval expression, \{m,n\} where m and n are non-negative integers with n >= m, applies to the preceeding character, character set, subexpression or backreference. It indicates that the preceeding element must match at least m times and may match as many as n times.

For example:

c\([ad]\)\{1,4\}

matches

car
cdr
caar
cdar
...
caaar
cdaar
...
cadddr
cddddr

Alternative Subexpressions

An alternative is written:

regexp-1\|regexp-2\|regexp-3\|...

It matches anything matched by some regexp-n. For example:

Crosby, Stills, \(and Nash\|Nash, and Young\)

matches

Crosby, Stills, and Nash

and

Crosby, Stills, Nash, and Young

Backreferences, Extractions and Substitutions

A backreference is written \n where n is some single digit other than 0. To be a valid backreference, there must be at least n parenthesized subexpressions in the pattern prior to the backreference.

A backreference matches a literal copy of whatever was matched by the corresponding subexpression. For example,

\(.*\)-\1

matches:

go-go
ha-ha
wakka-wakka
...

In some applications, subexpressions are used to extract substrings. For example, Emacs has the functions match-beginnning and match-end which report the positions of strings matched by subexpressions. These functions use the same numbering scheme for subexpressions as backreferences, with the additional rule that subexpression 0 is defined to be the whole regexp.

In some applications, subexpressions are used in string substitution. This again uses the backreference numbering scheme. For example, this sed command:

s/From:.*<\(.*\)>/To: \1/

first matches the line:

From: Joe Schmoe <schmoe@uspringfield.edu>

when it does, subexpression 1 matches "schmoe@uspringfield.edu". The command replaces the matched line with "To: \1" after doing subexpression substitution on it to get:

To: schmoe@uspringfield.edu

A Summary of Regexp Syntax

In summary, regexps can be:

abcd -- matching a string literally

. -- matching everything except NULL

[a-z_?], ^[a-z_?], [[:alpha:]] and [^[:alpha:]] -- matching character sets

\(subexp\) -- grouping an expression into a subexpression.

\n -- match a copy of whatever was matched by the nth subexpression.

The following special characters and sequences can be applied to a character, character set, subexpression, or backreference:

* -- repeat the preceeding element 0 or more times.

\+ -- repeat the preceeding element 1 or more times.

\? -- match the preceeding element 0 or 1 time.

{m,n} -- match the preceeding element at least m, and as many as n times.

regexp-1\|regexp-2\|.. -- match any regexp-n.

A special character, like . or * can be made into a literal character by prefixing it with \.

A special sequence, like \+ or \? can be made into a literal character by dropping the \.

Ambiguous Patterns

Sometimes a regular expression appears to be ambiguous. For example, suppose we compare the pattern:

begin\|beginning

to the string

beginning

either just the first 5 characters will match, or the whole string will match.

In every case like this, the longer match is preferred. The whole string will match.

Sometimes there is ambiguity not about how many characters to match, but where the subexpressions occur within the match. This can effect extraction functions like Emacs' match-beginning or rewrite functions like sed's s command. For example, consider matching the pattern:

b\(\[^q]*\)\(ing\)?

against the string

beginning

One possibility is that the first subexpression matches "eginning" and the second is skipped. Another possibility is that the first subexpression matches "eginn" and the second matches "ing".

The rule is that consistant with matching as many characters as possible, the length of lower numbered subexpressions is maximized in preference to maximizing the length of later subexpressions.

In the case of the above example, the two possible matches are equal in overall length. Therefore, it comes down to maximizing the lower-numbered subexpression, \1. The correct answer is that \1 matches "eginning" and \2 is skipped.

Complications in Subexpression Matching

Sometimes a subexpression matches a substring of no characters. This happens when `f\(o*\)' matches the string `fum'. (It really matches just the `f'.) In this case, both of the offsets identify the point in the string where the null substring was found. In this example, the offsets are both 1.

Sometimes the entire regular expression can match without using some of its subexpressions at all--for example, when `ba\(na\)*' matches the string `ba', the parenthetical subexpression is not used. When this happens, regexec stores -1 in both fields of the element for that subexpression.

Sometimes matching the entire regular expression can match a particular subexpression more than once--for example, when `ba\(na\)*' matches the string `bananana', the parenthetical subexpression matches three times. When this happens, regexec usually stores the offsets of the last part of the string that matched the subexpression. In the case of `bananana', these offsets are 6 and 8.

But the last match is not always the one that is chosen. It's more accurate to say that the last opportunity to match is the one that takes precedence. What this means is that when one subexpression appears within another, then the results reported for the inner subexpression reflect whatever happened on the last match of the outer subexpression. For an example, consider `\(ba\(na\)*s \)*' matching the string `bananas bas '. The last time the inner expression actually matches is near the end of the first word. But it is considered again in the second word, and fails to match there. regexec reports nonuse of the "na" subexpression.

Another place where this rule applies is when the regular expression `\(ba\(na\)*s \|nefer\(ti\)* \)*' matches `bananas nefertiti'. The "na" subexpression does match in the first word, but it doesn't match in the second word because the other alternative is used there. Once again, the second repetition of the outer subexpression overrides the first, and within that second repetition, the "na" subexpression is not used. So regexec reports nonuse of the "na" subexpression.